In [ ]:
import csv
import nltk
import math
import collections
from textblob import TextBlob
from pprint import pprint

Getting data into Python (basic python i/o)


In [ ]:
csvfile = open(
reader = 
data = []
for line in reader:
    line[3] = line[3].decode('utf-8')

In [ ]:
# getting the number of rows

In [ ]:
#taking a look at the first row

In [ ]:
comment_text =

Basic python string manipulation


In [ ]:
comment_text

In [ ]:
# strings are like lists of characters

In [ ]:
# use a colon for start:end indexes

In [ ]:
# they can be stuck together easily

In [ ]:
# and split apart
comment_text

In [ ]:
split_on_questions =

In [ ]:
# it's easy to strip whitespace off of them
for string in split_on_questions:

In [ ]:
# and cast them to one case
cleaned = 
cleaned

In [ ]:
# join them back together

In [ ]:
# and look for substring inside them

CHALLENGE: count the number of times the words "Hilary" or "Clinton" appear in the dataset


In [ ]:
for row in data:
    comment_text = row[-1]

Introducing TextBlob

Like a supercharged string, with lots of NLP niceties


In [ ]:
blob = TextBlob(data[80][-1])
blob

In [ ]:
# we can get lists of sentences

In [ ]:
# lists of words

In [ ]:
# lists of "tokens" (punctuation included)

In [ ]:
# even parts of speech and noun phrases

Summarizing/keywording text

How might we find representative words or phrases of a document?

A place to start: which words appear at the highest frequency in this document?


In [ ]:


In [ ]:
word_count = collections.Counter(

In [ ]:
word_count

Challenge: get overall word counts for all comments combined

potential approaches:

  • glue together all comments into one big blob
  • get word counts for each comment individually and use Counter's update function

In [ ]:

The Problem: words we use frequently don't make good unique identifiers.

One solution: use a list of words we don't want to include

"Stop Words"


In [ ]:
stopwords = nltk.corpus.stopwords.words('english')

In [ ]:
nltk.download()

In [ ]:
for key in word_count.keys():

In [ ]:

We could continue to add on stopwords as we try to make these keywords better. But it's kind of like playing whack-a-mole

An additional solution to The Problem: add a new term to our "representative-ness" measure that accounts for the overall rarity of the word

$$\frac { { n }_{ w } }{ N } $$

where ${ n }_{ w }$ is the number of documents containing word $ w $, and $ N $ is the total number of documents.

But we want a potential keyword to have a lower score if it is common in the corpus and a higher score if it is rarer, so we flip it:

$$\frac { N }{ { n }_{ w } } $$

It's also common to take the log of this to reduce the amount of disparity between extremely common and extremely uncommon terms.

$$ \log \frac { N }{ { n }_{ w } } $$

This is called IDF, or Inverse Document Frequency. Let's calculate it for all the words in our comment dataset!


In [ ]:
N_documents = float(len(data))
word_document_counts = 
word_idf = {}

In [ ]:
for row in data[1:]:
    blob = TextBlob(row[-1].lower())

In [ ]:
# calculate IDFs

For each word $ w $ in a given document $ D $, we can multiply the term frequency $$\frac { { D }_{ w } }{ { W }_{ D } } $$

where $ { D }_{ w } $ is the number of occurrences of word $ w $ in document $ D $

and $ { W }_{ D } $ is the total number of words in document $ D $

with the word's IDF that we just calculated to get TF-IDF scores, the highest ones being words that likely to be good representatives of that document.


In [ ]:
comment = data[80][-1]
blob = TextBlob(comment.lower())
num_words_in_comment = len(blob.words)
word_count = blob.word_counts

tf_scores = {}
for word, count in word_count.iteritems():
    if word not in stopwords and len(word) > 2:
        tf_scores[word] =

In [ ]:
tf_idf = {}
for word, tf in tf_scores.iteritems():
    tf_idf[word] = 

sorted(tf_idf.iteritems(), key=lambda k: k[1], reverse=True)[:5]

Note that TF-IDF can be tweaked in lots of other ways if you aren't getting good results.

It can also be done with "n-grams"— phrases that are n words long to capture multi word phrases like "gay rights" or "hillary clinton"

Additional demonstrations

Boiling down words: stemming


In [ ]:
from nltk.stem.porter import PorterStemmer

In [ ]:
stemmer = PorterStemmer()
print stemmer.stem('political')
print stemmer.stem('politics')
print stemmer.stem('politician')

Seeing words in context: concordance


In [ ]:
from nltk.text import Text
tokens = TextBlob(data[80][-1]).tokens
text_object = Text(tokens)
text_object.concordance('Hilary')

Sentiment Analysis


In [ ]:
blob = TextBlob(data[41][-1])
blob

In [ ]:
blob.sentiment

In [ ]:
blob.sentences[1].sentiment